RECOVERY IN MASSIVELY PARALLEL SYSTEMS 1 Recovery in Massively Parallel
نویسندگان
چکیده
The objective of ESPRIT-project 6731 FTMPS [1] is to develop techniques and system software to integrate Fault Tolerance in Massively Parallel Systems [2] . This covers the whole range from error detection, over fault-diagnosis and fault isolation to system and application recovery. Important is the research for applicability in massively parallel systems as well as the development of system software that may be commercialized in future products . The project-partners are : Parsytec Computer GmbH (D), British Aerospace Ltd. (UK), Katholieke Universiteit Leuven (B), Universitat-GH Paderborn (D) (recently replaced by the Medizinische Universitat zu Liibeck), Universitat Erlangen-Niirnberg (D) and Universidade de Coimbra (P). Although the Parsytec systems (the PowerXplorer is one of them) have been the development hardware, the developed methodologies and implementations have been kept as hardware independent as possible .
منابع مشابه
Recovery in Massively Parallel Systems
The objective of ESPRIT-project 6731 FTMPS [1] is to develop techniques and system software to integrate Fault Tolerance in Massively Parallel Systems [2]. This covers the whole range from error detection, over fault-diagnosis and fault isolation to system and application recovery. Important is the research for applicability in massively parallel systems as well as the development of system sof...
متن کاملFacing up to the Inevitable: Intelligent Error Recovery in Massively Parallel Processing in Memory Architectures
Massively parallel “Processing-In-Memory” (PIM) architectures have been shown to yield increases in performance due to their “memory-centric” nature. However, as PIM is still a developing technology, advanced issues such as error detection and failure recovery have not yet been addressed. We describe the application of concepts found in our multi-agent system, ADE, to PIM, incorporating its mec...
متن کاملMassively Parallel Execution Model and Massively Parallel Architecture
The purposes for the research and development of the RWC massively parallel computer project are (1) to e ciently support exible and integrated computation which are research targets in RWC Project, and (2) to pursue a general purpose massively parallel system e ciently supporting multiple programming paradigms, and (3) to realize a stand{alone system which has a mature operating system. For th...
متن کاملA massively parallel strategy for STR marker development, capture, and genotyping
Short tandem repeat (STR) variants are highly polymorphic markers that facilitate powerful population genetic analyses. STRs are especially valuable in conservation and ecological genetic research, yielding detailed information on population structure and short-term demographic fluctuations. Massively parallel sequencing has not previously been leveraged for scalable, efficient STR recovery. He...
متن کاملA User-triggered Checkpointing Library for Computationintensive Applications
We propose a method to incorporate coordinated checkpointing and rollback in high performance computing applications on massively parallel computers. A library allows the user to specify which data-items (including files) belong to the contents of the checkpoint, and to trigger the checkpointing in the application. The recovery-line management on the distributed disk system takes care of which ...
متن کامل